This section loads the necessary R packages for data manipulation, visualization, and other tasks. If a package is not installed, it installs it first and then loads it.
This section for loading the dataset from the provided URL and displaying a summary of the data.
Here are some of the key attributes:
StichtagDatJahr: Year of the data record HalterId: Identifier for the pet owner AlterV10Cd, AlterV10Lang, AlterV10Sort: Codes, descriptions, and sorting for the age group of the pet owner SexCd, SexLang, SexSort: Codes, descriptions, and sorting for the gender of the pet owner KreisCd, KreisLang, KreisSort: Codes, descriptions, and sorting for the district QuarCd, QuarLang, QuarSort: Codes, descriptions, and sorting for the quarter Rasse1Text, Rasse2Text: Primary and secondary breed texts of the pet RasseMischlingCd, RasseMischlingLang, RasseMischlingSort: Codes, descriptions, and sorting for whether the pet is a mixed breed RassentypCd, RassentypLang, RassentypSort: Codes, descriptions, and sorting for the type of breed GebDatHundJahr: Year of birth of the pet AlterVHundCd, AlterVHundLang, AlterVHundSort: Codes, descriptions, and sorting for the age group of the pet SexHundCd, SexHundLang, SexHundSort: Codes, descriptions, and sorting for the gender of the pet HundefarbeText: Color of the pet AnzHunde: Number of dogs
In this section, we duplicate and rename the dataframe df as df_EN for the English version. Then, translations for column names in English are defined. Following this, a function is employed to replace multiple patterns at once for content translation. Patterns and replacements for content translation, including translations for age groups, sexes, breed types, and dog colors, are defined. After applying the translation function across all columns, dog colors are also translated.
In the following R code snippet, we implement a method to distinguish unique OwnerId values within our dataset. By marking the initial occurrence of each OwnerId as unique, we facilitate further analyses that may require the identification of distinct entries.
df_EN$unique_OwnerId <- !duplicated(df_EN$OwnerId)
View(df_EN)
The R code below demonstrates the process of extracting a subset of relevant columns from our comprehensive dataset df_EN, thereby creating a streamlined DataFrame, new_df. This subset includes essential fields such as KeyDateYear, OwnerId, and details regarding the dogs such as PrimaryBreed and DogBirthYear. Additionally, the code converts the NumberOfDogs column from its original format to a numeric type, ensuring that subsequent data analysis can utilize numerical operations.
# Create a new DataFrame with selected columns and convert 'NumberOfDogs' to numeric
new_df <- df_EN %>%
select(KeyDateYear, OwnerId, OwnerAgeGroup, OwnerSex, DistrictSort, QuarCd, PrimaryBreed, SecondaryBreed, MixedBreed, BreedType, DogBirthYear, DogSex, NumberOfDogs, unique_OwnerId) %>%
mutate(NumberOfDogs = as.numeric(as.character(NumberOfDogs)))
View(new_df)
This series of R code snippets delves into the examination of key features within the new_df DataFrame, focusing on the identification and analysis of unique entries for KeyDateYear, OwnerId, and OwnerAgeGroup. Each code section is designed to extract unique values, count these entries, and where applicable, visualize the distribution. Such analysis is integral for understanding the dataset’s diversity across different dimensions, helping to highlight temporal coverage, ownership uniqueness, and demographic variations among owners.
# Extract and count unique years
unique_years <- unique(new_df$KeyDateYear)
number_of_unique_years <- length(unique_years)
print(number_of_unique_years)
## [1] 9
print(unique_years)
## [1] "2015" "2016" "2017" "2018" "2019" "2020" "2021" "2022" "2023"
# Extract and count unique Owner IDs
unique_Owner <- unique(new_df$OwnerId)
number_of_unique_Owner <- length(unique_Owner)
print(number_of_unique_Owner)
## [1] 15504
# Aggregate unique Owner IDs by Age Group and Year correctly with n_distinct
unique_owner_counts <- new_df %>%
group_by(KeyDateYear, OwnerAgeGroup) %>%
summarise(UniqueOwnerCount = n_distinct(OwnerId), .groups = 'drop') # Count unique IDs per group per year
# Print the aggregated results
print(unique_owner_counts)
## # A tibble: 90 × 3
## KeyDateYear OwnerAgeGroup UniqueOwnerCount
## <chr> <chr> <int>
## 1 2015 10 to 19 years old 23
## 2 2015 20 to 29 years old 604
## 3 2015 30 to 39 years old 1173
## 4 2015 40 to 49 years old 1287
## 5 2015 50 to 59 years old 1375
## 6 2015 60 to 69 years old 954
## 7 2015 70 to 79 years old 647
## 8 2015 80 to 89 years old 191
## 9 2015 90 to 99 years old 19
## 10 2015 Unknown 43
## # ℹ 80 more rows
# Adjust factor levels in the aggregated data before plotting
unique_owner_counts <- unique_owner_counts %>%
arrange(desc(UniqueOwnerCount)) %>%
mutate(OwnerAgeGroup = fct_inorder(OwnerAgeGroup))
# Get a list of unique years
years <- unique(unique_owner_counts$KeyDateYear)
# Loop through each year and create a plot, ordering by max number in the group
for (year in years) {
# Filter data for the specific year
data_for_year <- filter(unique_owner_counts, KeyDateYear == year)
# Create the plot
p <- ggplot(data_for_year, aes(x = OwnerAgeGroup, y = UniqueOwnerCount, fill = OwnerAgeGroup)) +
geom_bar(stat = "identity", position = "dodge") +
geom_hline(yintercept = c(100, 500, 1000, 1500, 2000), linetype = "dashed", color = "red") +
theme_minimal() +
labs(title = paste("Unique Owner IDs by Age Group in", year),
x = "Owner Age Group",
y = "Count of Unique Owner IDs") +
scale_fill_brewer(palette = "Paired") +
scale_y_continuous(limits = c(0, 2500), breaks = seq(0, 2500, by = 500)) + # Setting uniform y-axis scale
scale_x_discrete(labels = function(x) {
# Remove age numbers and the word "Unknown"
x <- gsub("[0-9]+ to [0-9]+ years old", "", x)
gsub("Unknown", "", x)
})
# Print the plot
print(p)
}
# Aggregate unique Owner IDs by Age Group, Year, and Gender
unique_owner_counts_gender <- new_df %>%
group_by(KeyDateYear, OwnerAgeGroup, OwnerSex) %>%
summarise(UniqueOwnerCountGender = n_distinct(OwnerId), .groups = 'drop') # Count unique IDs per group per year by gender
# Print the aggregated results
print(unique_owner_counts)
## # A tibble: 90 × 3
## KeyDateYear OwnerAgeGroup UniqueOwnerCount
## <chr> <fct> <int>
## 1 2023 30 to 39 years old 2376
## 2 2022 30 to 39 years old 2180
## 3 2023 40 to 49 years old 1930
## 4 2021 30 to 39 years old 1891
## 5 2022 40 to 49 years old 1813
## 6 2023 50 to 59 years old 1689
## 7 2022 50 to 59 years old 1688
## 8 2021 40 to 49 years old 1631
## 9 2020 30 to 39 years old 1607
## 10 2021 50 to 59 years old 1598
## # ℹ 80 more rows
print(unique_owner_counts_gender)
## # A tibble: 180 × 4
## KeyDateYear OwnerAgeGroup OwnerSex UniqueOwnerCountGender
## <chr> <chr> <chr> <int>
## 1 2015 10 to 19 years old female 17
## 2 2015 10 to 19 years old male 6
## 3 2015 20 to 29 years old female 467
## 4 2015 20 to 29 years old male 137
## 5 2015 30 to 39 years old female 797
## 6 2015 30 to 39 years old male 376
## 7 2015 40 to 49 years old female 838
## 8 2015 40 to 49 years old male 449
## 9 2015 50 to 59 years old female 941
## 10 2015 50 to 59 years old male 434
## # ℹ 170 more rows
# Get a list of unique years
years <- unique(unique_owner_counts$KeyDateYear)
# Loop through each year and create a plot with consistent y-axis scale
for (year in years) {
# Filter data for the specific year for both total counts and gender-specific counts
data_for_year_total <- filter(unique_owner_counts, KeyDateYear == year)
data_for_year_gender <- filter(unique_owner_counts_gender, KeyDateYear == year)
# Create the plot with bars for totals and points for gender
p <- ggplot(data_for_year_total, aes(x = OwnerAgeGroup, y = UniqueOwnerCount)) +
geom_bar(stat = "identity", position = "dodge", aes(fill = OwnerAgeGroup)) +
geom_point(data = data_for_year_gender,
aes(x = OwnerAgeGroup, y = UniqueOwnerCountGender, group = OwnerSex, color = OwnerSex),
position = position_dodge(width = 0.9), size = 3) +
geom_hline(yintercept = c(100, 500, 1000, 1500), linetype = "dashed", color = "red") +
theme_minimal() +
labs(title = paste("Unique Owner IDs by Age Group and Gender in", year),
x = "Owner Age Group",
y = "Count of Unique Owner IDs") +
scale_fill_brewer(palette = "Paired") + # Color for bars
scale_color_manual(values = c("female" = "pink", "male" = "blue")) + # Color for points
scale_y_continuous(limits = c(0, 2500), breaks = seq(0, 2500, by = 500)) +
scale_x_discrete(labels = function(x) {
# Remove age numbers and the word "Unknown"
x <- gsub("[0-9]+ to [0-9]+ years old", "", x)
gsub("Unknown", "", x)
})
# Print the plot
print(p)
}
# Adjust factor levels in the aggregated data before plotting
unique_owner_counts <- unique_owner_counts %>%
arrange(desc(UniqueOwnerCount)) %>%
mutate(OwnerAgeGroup = fct_inorder(OwnerAgeGroup),
KeyDateYear = as.numeric(as.character(KeyDateYear))) # Convert KeyDateYear to numeric
# Create the line plot for all years with a line per age group
p <- ggplot(unique_owner_counts, aes(x = KeyDateYear, y = UniqueOwnerCount, group = OwnerAgeGroup, color = OwnerAgeGroup)) +
geom_line(size = 1) + # Add line
geom_point(size = 3) + # Add points
geom_hline(yintercept = c(100, 500, 1000, 1500, 2000), linetype = "dashed", color = "red") +
theme_minimal() +
labs(title = "Unique Owner IDs by Age Group Over Years",
x = "Year",
y = "Count of Unique Owner IDs") +
scale_color_brewer(palette = "Paired") + # Use color to differentiate lines
scale_y_continuous(limits = c(0, 2500), breaks = seq(0, 2500, by = 500)) +
scale_x_continuous(breaks = seq(min(unique_owner_counts$KeyDateYear), max(unique_owner_counts$KeyDateYear), by = 1)) # Define breaks in the x-axis scale to show each year
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Print the plot
print(p)
After confirming successful conversion, we aggregated the data to compute the total number of dogs per year. The resulting counts were then visualized using histograms to illustrate the distribution over the years.
Furthermore, to understand the trend in dog population over time, we calculated the percentage change between consecutive years. This allowed us to identify any notable fluctuations or patterns in the data.
new_df <- new_df %>%
mutate(NumberOfDogs = as.numeric(NumberOfDogs))
# Check for any conversion problems
sum(is.na(new_df$NumberOfDogs))
## [1] 0
new_df$KeyDateYear <- as.numeric(as.character(new_df$KeyDateYear))
# Aggregate data to get total number of dogs per year
yearly_dog_counts <- new_df %>%
group_by(KeyDateYear) %>%
summarize(TotalDogs = sum(NumberOfDogs), .groups = 'drop')
print(yearly_dog_counts)
## # A tibble: 9 × 2
## KeyDateYear TotalDogs
## <dbl> <dbl>
## 1 2015 6980
## 2 2016 6930
## 3 2017 7155
## 4 2018 7400
## 5 2019 7647
## 6 2020 7841
## 7 2021 8574
## 8 2022 9173
## 9 2023 9512
# Create the histogram
ggplot(yearly_dog_counts, aes(x = KeyDateYear, y = TotalDogs)) +
geom_col(fill = "skyblue", color = "black") + # Using geom_col for clarity
geom_hline(yintercept = c(2500, 5000, 7500), linetype = "dashed", color = "red") +
theme_minimal() +
labs(title = "Total Number of Dogs per Year",
x = "Year",
y = "Total Number of Dogs") +
scale_x_continuous(breaks = yearly_dog_counts$KeyDateYear,
labels = yearly_dog_counts$KeyDateYear) +
scale_y_continuous(labels = scales::comma,
breaks = seq(0, 10000, by = 1000),
limits = c(0, 10000)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
### Calculate the PERCENTAGE CHANGE for annotations ###
yearly_dog_counts <- yearly_dog_counts %>%
arrange(KeyDateYear) %>%
mutate(Change = c(NA, diff(TotalDogs)),
PercentChange = Change / lag(TotalDogs) * 100)
# Create the base plot
ggplot(yearly_dog_counts, aes(x = KeyDateYear, y = TotalDogs)) +
geom_col(fill = "skyblue", color = "black") +
geom_hline(yintercept = c(2500, 5000, 7500), linetype = "dashed", color = "red") + # Dashed lines at specified counts
geom_smooth(method = "lm", color = "red", linetype = "dashed", se = FALSE) + # Add a linear trend line
geom_text(data = yearly_dog_counts, aes(label = sprintf("%.1f%%", PercentChange)),
vjust = -1.5, hjust = 0.5, color = "darkgreen", size = 3.5) +
theme_minimal() +
labs(title = "Total Number of Dogs per Year",
x = "Year",
y = "Total Number of Dogs") +
scale_x_continuous(breaks = yearly_dog_counts$KeyDateYear) +
scale_y_continuous(labels = scales::comma,
breaks = seq(0, 10000, by = 1000), # Adjust y-axis to have units of 1000
limits = c(0, 10000)) + # Ensure y-axis goes up to 10000
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "bottom")
## `geom_smooth()` using formula = 'y ~ x'
We explore the unique owners per year based on gender. The analysis starts by counting each owner once per year, regardless of the number of dogs they own.
# Step 1: Count each owner once per year regardless of the number of dogs
yearly_dog_counts_by_gender_unique <- new_df %>%
select(KeyDateYear, OwnerId, OwnerSex) %>%
distinct(KeyDateYear, OwnerId, OwnerSex) %>%
group_by(KeyDateYear, OwnerSex) %>%
summarize(UniqueOwners = n(), .groups = 'drop')
print(yearly_dog_counts_by_gender_unique)
## # A tibble: 18 × 3
## KeyDateYear OwnerSex UniqueOwners
## <dbl> <chr> <int>
## 1 2015 female 4276
## 2 2015 male 2040
## 3 2016 female 4263
## 4 2016 male 2012
## 5 2017 female 4413
## 6 2017 male 2035
## 7 2018 female 4581
## 8 2018 male 2096
## 9 2019 female 4746
## 10 2019 male 2183
## 11 2020 female 4882
## 12 2020 male 2261
## 13 2021 female 5379
## 14 2021 male 2483
## 15 2022 female 5825
## 16 2022 male 2639
## 17 2023 female 6045
## 18 2023 male 2776
# Step 2: Create histograms for male and female owners without repetition
ggplot(yearly_dog_counts_by_gender_unique, aes(x = KeyDateYear, y = UniqueOwners, fill = OwnerSex)) +
geom_bar(stat = "identity", position = position_dodge()) +
theme_minimal() +
labs(title = "Unique Owners per Year by Gender",
x = "Year",
y = "Number of Unique Owners") +
scale_fill_manual(values = c("male" = "blue", "female" = "pink")) +
geom_hline(yintercept = c(2500, 5000, 7500), linetype = "dashed", color = "red") +
scale_y_continuous(labels = scales::comma) +
scale_x_continuous(breaks = yearly_dog_counts_by_gender_unique$KeyDateYear) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
We proceeded to create a heatmap illustrating the distribution of dogs based on the gender and age group of their owners across different years. First, we organized the data by grouping it according to the year, owner’s age group, and gender. Then, we iterated through each year, generating separate heatmaps to visualize the data for that specific year. Each heatmap represents the total number of dogs in various age groups categorized by the gender of their owners. The color gradient within the heatmap indicates the intensity of dog ownership, with warmer colors representing higher dog counts.
# Adjust the aggregation to count each owner once per age group and gender, per year
owner_counts <- new_df %>%
select(KeyDateYear, OwnerAgeGroup, OwnerSex, OwnerId) %>%
distinct(KeyDateYear, OwnerAgeGroup, OwnerSex, OwnerId) %>%
group_by(KeyDateYear, OwnerAgeGroup, OwnerSex) %>%
summarize(UniqueOwners = n(), .groups = 'drop')
# Loop through each year and create a heatmap
years <- unique(owner_counts$KeyDateYear)
for (year in years) {
yearly_data <- filter(owner_counts, KeyDateYear == year)
p <- ggplot(yearly_data, aes(x = OwnerSex, y = OwnerAgeGroup, fill = UniqueOwners)) +
geom_tile() +
scale_fill_gradientn(colors = brewer.pal(11, "Spectral"), limits = c(0, max(yearly_data$UniqueOwners, na.rm = TRUE)), name = "Total Dogs") +
theme_minimal() +
labs(title = paste("Heatmap of Unique Owners by Gender and Age Group in", year),
x = "Owner's Gender",
y = "Owner's Age Group",
fill = "Number of Unique Owners") +
theme(axis.text.y = element_text(angle = 45, hjust = 1))
print(p)
}
# 10. Total Count of Dogs by District
We focus on examining the annual distribution of dog populations across various districts. To achieve this, we have implemented a specific R script that systematically processes and visualizes data for each year from our dataset.
# Adjusting DistrictSort to have levels from 1 to 12 as indicated
new_df$DistrictSort <- factor(new_df$DistrictSort, levels = as.character(1:12))
# Extract unique years for iteration
unique_years <- unique(new_df$KeyDateYear)
# Iterate over each year and create a bar plot for districts 1 through 12, excluding NAs
for (year in unique_years) {
yearly_data <- new_df %>%
filter(KeyDateYear == year, !is.na(DistrictSort)) %>%
group_by(DistrictSort) %>%
summarize(TotalDogs = sum(NumberOfDogs), .groups = 'drop') %>%
arrange(desc(TotalDogs)) # Arranging data by TotalDogs in descending order
# Plot for the year with districts sorted by the number of dogs
p <- ggplot(yearly_data, aes(x = reorder(DistrictSort, -TotalDogs), y = TotalDogs, fill = DistrictSort)) +
geom_col() + # Using geom_col for bar plots
geom_hline(yintercept = c(100, 500, 1000, 1500), linetype = "dashed", color = "red") + # Adding dashed lines at specified y-values
scale_fill_viridis_d(name = "District") +
scale_y_continuous(limits = c(0, 2000), breaks = seq(0, 2000, by = 500)) + # Standardize y-axis up to 2000
theme_minimal() +
labs(title = paste("Total Count of Dogs by District in", year),
x = "District",
y = "Total Number of Dogs") +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
legend.position = "none") # Removing legend for clarity
# Display the plot
print(p)
}
To enhance our understanding of the distribution of dogs across different districts and to introduce a gender perspective into our analysis, we have modified our approach to include a breakdown by gender. This adjustment allows us to observe not only the geographical distribution but also gender dynamics within the dog population each year.
# Ensure DistrictSort is a factor with levels from 1 to 12
new_df$DistrictSort <- factor(new_df$DistrictSort, levels = as.character(1:12))
# Extract unique years for iteration
unique_years <- unique(new_df$KeyDateYear)
# Iterate over each year and create a bar plot for districts 1 through 12, excluding NAs
for (year in unique_years) {
yearly_data <- new_df %>%
filter(KeyDateYear == year, !is.na(DistrictSort)) %>%
group_by(DistrictSort, DogSex) %>% # Group by district and dog sex
summarize(TotalDogs = sum(NumberOfDogs), .groups = 'drop') %>%
arrange(desc(TotalDogs)) # Arrange by TotalDogs in descending order for each group
# Reorder DistrictSort based on TotalDogs for clearer visualization
yearly_data$DistrictSort <- factor(yearly_data$DistrictSort, levels = unique(yearly_data$DistrictSort))
# Plot for the year with districts 1 through 12, differentiating by gender
p <- ggplot(yearly_data, aes(x = DistrictSort, y = TotalDogs, fill = DogSex)) +
geom_col(position = position_dodge()) + # Using geom_col with dodge to separate male and female bars
geom_hline(yintercept = c(100, 500), linetype = "dashed", color = "red") + # Adding dashed lines
scale_fill_viridis_d(name = "Dog Gender") +
scale_y_continuous(limits = c(0, 800), breaks = seq(0, 800, by = 100)) +
theme_minimal() +
labs(title = paste("Total Count of Dogs by District and Gender in", year),
x = "District",
y = "Total Number of Dogs") +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
legend.position = "bottom") # Adjust legend position for clarity
print(p) # Display the plot
}
# Ensure DistrictSort is a factor with levels from 1 to 12 in new_df
new_df$DistrictSort <- factor(new_df$DistrictSort, levels = as.character(1:12))
# Extract unique years for iteration
unique_years <- unique(new_df$KeyDateYear)
# Iterate over each year and create a bar plot for districts 1 through 12, excluding NAs in DistrictSort
for (year in unique_years) {
# Summarize total dogs by district
yearly_total <- new_df %>%
filter(KeyDateYear == year, !is.na(DistrictSort)) %>%
group_by(DistrictSort) %>%
summarize(TotalDogs = sum(NumberOfDogs), .groups = 'drop') %>%
arrange(desc(TotalDogs)) # Arrange by TotalDogs in descending order for clarity in visualization
# Summarize total dogs by district and gender
yearly_gender <- new_df %>%
filter(KeyDateYear == year, !is.na(DistrictSort)) %>%
group_by(DistrictSort, OwnerSex) %>%
summarize(TotalDogs = sum(NumberOfDogs), .groups = 'drop') %>%
ungroup() %>%
mutate(Position = as.numeric(DistrictSort) + ifelse(OwnerSex == "female", -0.2, 0.2)) # Adjust positions for clarity
# Count unique owners by gender
yearly_owner_count <- new_df %>%
filter(KeyDateYear == year) %>%
distinct(OwnerId, OwnerSex) %>%
group_by(OwnerSex) %>%
summarize(UniqueOwners = n(), .groups = 'drop')
# Create the plot
p <- ggplot() +
geom_bar(data = yearly_total, aes(x = DistrictSort, y = TotalDogs, fill = DistrictSort), stat = "identity") +
geom_point(data = yearly_gender, aes(x = Position, y = TotalDogs, color = OwnerSex), size = 3) +
geom_text(data = yearly_owner_count, aes(x = Inf, y = Inf, label = paste(UniqueOwners, "unique owner(s)"), hjust = 1, vjust = 1), size = 3, color = "black") + # Add text for unique owner count
geom_hline(yintercept = c(100, 500, 1000, 1500), linetype = "dashed", color = "red") + # Adding dashed lines at specified y-values
scale_fill_viridis_d() +
scale_color_manual(values = c("female" = "pink", "male" = "blue")) +
scale_y_continuous(limits = c(0, 2000), breaks = seq(0, 2000, by = 500)) + # Standardize y-axis up to 2000
theme_minimal() +
labs(title = paste("Total Count of Dogs by District in", year, "— Female vs Male"),
subtitle = "Bar: Total Count | Points: Count by Gender",
x = "District",
y = "Total Count of Dogs") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
print(p)
}
In order to deepen our analysis of dog populations across different districts annually, our R script now incorporates an additional layer of granularity by assessing dog counts not only by district but also by breed type. This enhancement aims to provide a more detailed view of the diversity within the canine populations across the various districts each year.
# Adjusting DistrictSort to have levels from 1 to 12 as initially indicated
new_df$DistrictSort <- factor(new_df$DistrictSort, levels = as.character(1:12))
# Extract unique years for iteration
unique_years <- unique(new_df$KeyDateYear)
# Iterate over each year and create a bar plot for districts 1 through 12
for (year in unique_years) {
# Summarize total dogs by district
yearly_total <- new_df %>%
filter(KeyDateYear == year, !is.na(DistrictSort)) %>%
group_by(DistrictSort) %>%
summarize(TotalDogs = sum(NumberOfDogs), .groups = 'drop')
# Summarize total dogs by district and breed type, excluding "Unknown" breed type
yearly_breed <- new_df %>%
filter(KeyDateYear == year, !is.na(DistrictSort), BreedType != "Unknown") %>%
group_by(DistrictSort, BreedType) %>%
summarize(TotalDogs = sum(NumberOfDogs), .groups = 'drop') %>%
ungroup() %>%
mutate(Position = as.numeric(DistrictSort)) # Position adjustment can be refined as needed
# Create the plot
p <- ggplot() +
geom_bar(data = yearly_total, aes(x = DistrictSort, y = TotalDogs, fill = DistrictSort), stat = "identity") +
geom_point(data = yearly_breed, aes(x = Position, y = TotalDogs, shape = BreedType), size = 3, position = position_jitterdodge(jitter.width = 0.2)) +
geom_hline(yintercept = c(100, 500, 1000, 1500), linetype = "dashed", color = "red") + # Adding dashed lines at specified y-values
scale_fill_viridis_d(name = "District") +
scale_color_viridis_d() + # If BreedType needs color coding, this could be added/adjusted
scale_shape_manual(values = seq(1, 20)) + # Manually specifying shapes for breed type distinction
scale_size(range = c(1, 6), name = "Total Dogs per Breed Type") + # Adjusting point size to reflect total dogs per breed type
scale_y_continuous(limits = c(0, 2000), breaks = seq(0, 2000, by = 500)) + # Standardize y-axis up to 2000
theme_minimal() +
labs(title = paste("Total Count of Dogs by District in", year),
subtitle = "Points indicate count by breed type",
x = "District",
y = "Total Count of Dogs") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
print(p)
}
In our continuing efforts to provide a comprehensive analysis of the dog populations within various districts, our latest R script has been enhanced to include not only total counts by district but also a detailed breakdown by breed type and gender.
# Adjust DistrictSort to have levels from 1 to 12 as initially indicated in new_df
new_df$DistrictSort <- factor(new_df$DistrictSort, levels = as.character(1:12))
# Extract unique years for iteration
unique_years <- unique(new_df$KeyDateYear)
for (year in unique_years) {
# Calculate total number of dogs per district for each year
yearly_total <- new_df %>%
filter(KeyDateYear == year, !is.na(DistrictSort)) %>%
group_by(DistrictSort) %>%
summarize(TotalDogs = sum(NumberOfDogs), .groups = 'drop')
# Calculate total number of dogs per district per breed type and gender
yearly_breed_gender <- new_df %>%
filter(KeyDateYear == year, !is.na(DistrictSort), BreedType != "Unknown") %>%
group_by(DistrictSort, BreedType, DogSex) %>%
summarize(TotalDogs = sum(NumberOfDogs), .groups = 'drop') %>%
ungroup() %>%
mutate(Position = as.numeric(DistrictSort) + ifelse(DogSex == "female", -0.2, 0.2)) # Adjust position slightly for clarity
# Plotting both total counts and counts by breed type and gender
p <- ggplot() +
geom_bar(data = yearly_total, aes(x = DistrictSort, y = TotalDogs, fill = DistrictSort), stat = "identity") +
geom_point(data = yearly_breed_gender, aes(x = Position, y = TotalDogs, color = DogSex, shape = BreedType), size = 3, position = position_jitterdodge(jitter.width = 0.1)) +
geom_hline(yintercept = c(100, 500, 1000, 1500), linetype = "dashed", color = "red") +
scale_fill_viridis_d(name = "District") +
scale_color_manual(values = c("female" = "pink", "male" = "blue")) +
scale_shape_manual(values = seq(1, 20)) +
scale_y_continuous(limits = c(0, 2000), breaks = seq(0, 2000, by = 500)) +
theme_minimal() +
labs(title = paste("Total Count of Dogs by District, Breed, and Gender in", year),
subtitle = "Bar: Total Count | Points: Count by Breed and Gender",
x = "District",
y = "Total Count of Dogs") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
print(p)
}
## Warning: `position_jitterdodge()` requires non-overlapping x intervals
## Warning: `position_jitterdodge()` requires non-overlapping x intervals
## Warning: `position_jitterdodge()` requires non-overlapping x intervals
## Warning: `position_jitterdodge()` requires non-overlapping x intervals
## Warning: `position_jitterdodge()` requires non-overlapping x intervals
## Warning: `position_jitterdodge()` requires non-overlapping x intervals
## Warning: `position_jitterdodge()` requires non-overlapping x intervals
## Warning: `position_jitterdodge()` requires non-overlapping x intervals
## Warning: `position_jitterdodge()` requires non-overlapping x intervals
##11.1. Including Unknown by year
# Data preprocessing with new transformations for new_df
new_df <- new_df %>%
mutate(DogBirthYear = as.numeric(as.character(DogBirthYear)),
KeyDateYear = as.numeric(as.character(KeyDateYear)),
DogAge = KeyDateYear - DogBirthYear)
# Extract unique years for iteration
unique_years <- unique(new_df$KeyDateYear)
# Function to assign colors to PrimaryBreed
assign_colors <- function(data) {
n_breeds <- length(unique(data$PrimaryBreed))
palette <- scales::hue_pal()(n_breeds)
breed_color_map <- setNames(palette, unique(data$PrimaryBreed))
return(breed_color_map)
}
# Iterate over each year and create a bar plot for primary breeds, excluding NAs
for (year in unique_years) {
yearly_data <- new_df %>%
filter(KeyDateYear == year, (DogAge >= 0 & DogAge <= 15) | is.na(DogAge)) %>%
group_by(PrimaryBreed) %>%
summarize(TotalDogs = sum(NumberOfDogs), .groups = 'drop') %>%
arrange(desc(TotalDogs)) %>% # Arrange by TotalDogs in descending order
slice_max(order_by = TotalDogs, n = 10) # Select top 10 PrimaryBreed based on TotalDogs
# Assign colors to PrimaryBreed
breed_color_map <- assign_colors(yearly_data)
# Plot for the year showing top 10 primary breeds
p <- ggplot(yearly_data, aes(x = reorder(PrimaryBreed, TotalDogs), y = TotalDogs, fill = PrimaryBreed)) +
geom_col() + # Regular bar plot
scale_fill_manual(values = breed_color_map) + # Set manual colors for PrimaryBreed
geom_hline(yintercept = c(250, 500), linetype = "dashed", color = "red") + # Adding dashed lines at specified y-values
scale_y_continuous(limits = c(0, max(750, max(yearly_data$TotalDogs))), breaks = seq(0, max(750, max(yearly_data$TotalDogs)), by = 250)) + # Adjust y-axis based on max dog count
theme_minimal() +
labs(title = paste("Top 10 Primary Breeds by Total Count of Dogs in", year),
x = "Primary Breed",
y = "Total Number of Dogs") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(p) # Display the plot
}
##11.2. Excluding Unknown by year
# Data preprocessing with new transformations for new_df
new_df <- new_df %>%
mutate(DogBirthYear = as.numeric(as.character(DogBirthYear)),
KeyDateYear = as.numeric(as.character(KeyDateYear)),
DogAge = KeyDateYear - DogBirthYear)
# Extract unique years for iteration
unique_years <- unique(new_df$KeyDateYear)
# Function to assign colors to PrimaryBreed
assign_colors <- function(data) {
n_breeds <- length(unique(data$PrimaryBreed))
palette <- scales::hue_pal()(n_breeds)
breed_color_map <- setNames(palette, unique(data$PrimaryBreed))
return(breed_color_map)
}
# Iterate over each year and create a bar plot for primary breeds, excluding NAs and 'Unknown'
for (year in unique_years) {
yearly_data <- new_df %>%
filter(KeyDateYear == year, (DogAge >= 0 & DogAge <= 15) | is.na(DogAge), PrimaryBreed != "Unknown") %>%
group_by(PrimaryBreed) %>%
summarize(TotalDogs = sum(NumberOfDogs), .groups = 'drop') %>%
arrange(desc(TotalDogs)) %>% # Arrange by TotalDogs in descending order
slice_max(order_by = TotalDogs, n = 10) # Select top 10 PrimaryBreeds based on TotalDogs
# Assign colors to PrimaryBreed
breed_color_map <- assign_colors(yearly_data)
# Plot for the year showing top 10 primary breeds
p <- ggplot(yearly_data, aes(x = reorder(PrimaryBreed, TotalDogs), y = TotalDogs, fill = PrimaryBreed)) +
geom_col() + # Regular bar plot
scale_fill_manual(values = breed_color_map) + # Set manual colors for PrimaryBreed
geom_hline(yintercept = c(250, 500), linetype = "dashed", color = "red") + # Adding dashed lines at specified y-values
scale_y_continuous(limits = c(0, max(750, max(yearly_data$TotalDogs, na.rm = TRUE))), breaks = seq(0, max(750, max(yearly_data$TotalDogs, na.rm = TRUE)), by = 250)) + # Adjust y-axis based on max dog count
theme_minimal() +
labs(title = paste("Top 10 Primary Breeds by Total Count of Dogs in", year),
x = "Primary Breed",
y = "Total Number of Dogs") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(p) # Display the plot
}
##11.3. Including Unknown by district
R script designed to analyze and visualize the distribution of the top five primary dog breeds in each district annually.
# Data preprocessing with new transformations for new_df
new_df <- new_df %>%
mutate(DogBirthYear = as.numeric(as.character(DogBirthYear)),
KeyDateYear = as.numeric(as.character(KeyDateYear)),
DogAge = KeyDateYear - DogBirthYear)
# Ensure 'DistrictSort' is set as a factor with levels from 1 to 12 for proper ordering
new_df$DistrictSort <- factor(new_df$DistrictSort, levels = as.character(1:12))
# Extract unique years for iteration
unique_years <- unique(new_df$KeyDateYear)
# Function to assign colors to PrimaryBreed
assign_colors <- function(data) {
n_breeds <- length(unique(data$PrimaryBreed))
palette <- scales::hue_pal()(n_breeds)
breed_color_map <- setNames(palette, unique(data$PrimaryBreed))
return(breed_color_map)
}
# Iterate over each year and create a bar plot for districts 1 through 12, excluding NAs
for (year in unique_years) {
yearly_data <- new_df %>%
filter(KeyDateYear == year, !is.na(DistrictSort), (DogAge >= 0 & DogAge <= 15) | is.na(DogAge)) %>%
group_by(DistrictSort, PrimaryBreed) %>%
summarize(TotalDogs = sum(NumberOfDogs), .groups = 'drop') %>%
arrange(DistrictSort, desc(TotalDogs)) %>% # Arrange by DistrictSort and TotalDogs in descending order
group_by(DistrictSort) %>%
top_n(6, wt = TotalDogs) %>% # Select top 5 PrimaryBreed for each DistrictSort based on TotalDogs
ungroup()
# Assign colors to PrimaryBreed
breed_color_map <- assign_colors(yearly_data)
# Plot for the year with districts 1 through 12
p <- ggplot(yearly_data, aes(x = DistrictSort, y = TotalDogs, fill = PrimaryBreed)) +
geom_col(position = "stack") + # Stacked bar plot
scale_fill_manual(values = breed_color_map) + # Set manual colors for PrimaryBreed
geom_hline(yintercept = c(250, 500), linetype = "dashed", color = "red") + # Adding dashed lines at specified y-values
scale_y_continuous(limits = c(0, 750), breaks = seq(0, 750, by = 250)) + # Standardize y-axis up to 2000
theme_minimal() +
labs(title = paste("Top 5 Primary Breeds by Total Count of Dogs in", year),
x = "District",
y = "Total Number of Dogs") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
print(p) # Display the plot
}
Explicitly excluding any entries categorized under “Unknown.” This refined focus allows for a more precise and meaningful understanding of breed popularity and distribution, crucial for targeted animal welfare and urban planning strategies.
# Data preprocessing with new transformations for new_df
new_df <- new_df %>%
mutate(DogBirthYear = as.numeric(as.character(DogBirthYear)),
KeyDateYear = as.numeric(as.character(KeyDateYear)),
DogAge = KeyDateYear - DogBirthYear)
# Ensure 'DistrictSort' is set as a factor with levels from 1 to 12 for proper ordering
new_df$DistrictSort <- factor(new_df$DistrictSort, levels = as.character(1:12))
# Extract unique years for iteration
unique_years <- unique(new_df$KeyDateYear)
# Function to assign colors to PrimaryBreed
assign_colors <- function(data) {
n_breeds <- length(unique(data$PrimaryBreed))
palette <- scales::hue_pal()(n_breeds)
breed_color_map <- setNames(palette, unique(data$PrimaryBreed))
return(breed_color_map)
}
# Iterate over each year and create a bar plot for districts 1 through 12, excluding NAs and "Unknown" breeds
for (year in unique_years) {
yearly_data <- new_df %>%
filter(KeyDateYear == year, !is.na(DistrictSort), (DogAge >= 0 & DogAge <= 15) | is.na(DogAge), PrimaryBreed != "Unknown") %>%
group_by(DistrictSort, PrimaryBreed) %>%
summarize(TotalDogs = sum(NumberOfDogs), .groups = 'drop') %>%
arrange(DistrictSort, desc(TotalDogs)) %>% # Arrange by DistrictSort and TotalDogs in descending order
group_by(DistrictSort) %>%
top_n(6, wt = TotalDogs) %>% # Select top 5 PrimaryBreed for each DistrictSort based on TotalDogs
ungroup()
# Assign colors to PrimaryBreed
breed_color_map <- assign_colors(yearly_data)
# Plot for the year with districts 1 through 12
p <- ggplot(yearly_data, aes(x = DistrictSort, y = TotalDogs, fill = PrimaryBreed)) +
geom_col(position = "stack") + # Stacked bar plot
scale_fill_manual(values = breed_color_map) + # Set manual colors for PrimaryBreed
geom_hline(yintercept = c(250, 500), linetype = "dashed", color = "red") + # Adding dashed lines at specified y-values
scale_y_continuous(limits = c(0, 750), breaks = seq(0, 750, by = 250)) + # Standardize y-axis up to 750
theme_minimal() +
labs(title = paste("Top 6 Primary Breeds by Total Count of Dogs in", year),
x = "District",
y = "Total Number of Dogs") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
print(p) # Display the plot
}
To analyze the aging patterns of dogs from 2015 to 2023, we first preprocess the dataset to ensure the necessary variables are in the appropriate format. This includes converting DogAgeGroupCd to numeric and KeyDateYear to numeric. After this preprocessing step, we proceed with the following steps:
Identifying Dogs Present in 2015 and 2023: We filter the dataset to extract information about dogs present in 2015 and 2023, focusing on their OwnerId, PrimaryBreed, and SecondaryBreed.
Finding Dogs Present in Both Years: We find the intersection of dogs present in 2015 and 2023 to identify those that survived from 2015 to 2023.
Filtering Dataset for Surviving Dogs: Using the information obtained from the previous step, we filter the original dataset to retain records of dogs present in both 2015 and 2023.
Checking for Consistent Age Progression: We calculate the age difference for each dog between 2015 and 2023 and identify dogs with a consistent age progression of 8 years, assuming that DogAge reflects each dog’s age accurately.
RESULT: This data suggests that many dogs observed in both 2015 and 2023 have aged consistently by 8 years, indicating a typical aging pattern over the observed period.
# Convert to numeric and prepare new_df
new_df <- new_df %>%
mutate(DogBirthYear = as.numeric(as.character(DogBirthYear)),
KeyDateYear = as.numeric(as.character(KeyDateYear)),
DogAge = KeyDateYear - DogBirthYear)
# Step 1: Identify dogs present in 2015
dogs_in_2015 <- new_df %>%
filter(KeyDateYear == 2015) %>%
select(OwnerId, PrimaryBreed, SecondaryBreed)
# Step 2: Identify dogs present in 2023
dogs_in_2023 <- new_df %>%
filter(KeyDateYear == 2023) %>%
select(OwnerId, PrimaryBreed, SecondaryBreed)
# Step 3: Intersect the two groups to find dogs present in both years
surviving_dogs <- intersect(dogs_in_2015, dogs_in_2023)
# Filter the original dataset for these dogs and years
surviving_dogs_records <- new_df %>%
semi_join(surviving_dogs, by = c("OwnerId", "PrimaryBreed", "SecondaryBreed")) %>%
filter(KeyDateYear %in% c(2015, 2023))
# Check for consistent age progression for these dogs
surviving_dogs_age_check <- surviving_dogs_records %>%
group_by(OwnerId, PrimaryBreed, SecondaryBreed) %>%
summarise(AgeDifference = diff(sort(DogAge)), .groups = "drop")
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Dogs with consistent age progression
consistent_age_progression <- surviving_dogs_age_check %>%
filter(AgeDifference == 8)
print(consistent_age_progression)
## # A tibble: 1,670 × 4
## OwnerId PrimaryBreed SecondaryBreed AgeDifference
## <chr> <chr> <chr> <dbl>
## 1 100002 Chihuahua none 8
## 2 100018 Zwergschnauzer none 8
## 3 100030 Chihuahua none 8
## 4 100034 Unknown Unknown 8
## 5 100052 Jack Russel Terrier none 8
## 6 100060 Chihuahua none 8
## 7 100147 Unknown Unknown 8
## 8 100181 Malteser none 8
## 9 100181 Zwergpudel none 8
## 10 100184 Mops none 8
## # ℹ 1,660 more rows
R code used to track the aging progression of dogs from 2015 to 2023 while excluding those with an ‘Unknown’ primary breed. This process involves identifying dogs present in both 2015 and 2023, filtering out those with an ‘Unknown’ primary breed, and then checking for consistent age progression among these dogs. The results are printed to examine dogs with an expected age difference of 8 years between 2015 and 2023.
RESULT: ID 100002 has a Chihuahua with no secondary breed, and the age of this dog has increased by 8 years from 2015 to 2023. Similarly, each subsequent row provides information about different dogs owned by different individuals, their breeds, and the corresponding age differences over the specified period.
# Convert to numeric and prepare new_df
new_df <- new_df %>%
mutate(DogBirthYear = as.numeric(as.character(DogBirthYear)),
KeyDateYear = as.numeric(as.character(KeyDateYear)),
DogAge = KeyDateYear - DogBirthYear)
# Step 1: Identify dogs present in 2015, excluding 'Unknown' breeds
dogs_in_2015 <- new_df %>%
filter(KeyDateYear == 2015, PrimaryBreed != "Unknown", SecondaryBreed != "Unknown") %>%
select(OwnerId, PrimaryBreed, SecondaryBreed)
# Step 2: Identify dogs present in 2023, excluding 'Unknown' breeds
dogs_in_2023 <- new_df %>%
filter(KeyDateYear == 2023, PrimaryBreed != "Unknown", SecondaryBreed != "Unknown") %>%
select(OwnerId, PrimaryBreed, SecondaryBreed)
# Step 3: Intersect the two groups to find dogs present in both years
surviving_dogs <- intersect(dogs_in_2015, dogs_in_2023)
# Filter the original dataset for these dogs and years
surviving_dogs_records <- new_df %>%
semi_join(surviving_dogs, by = c("OwnerId", "PrimaryBreed", "SecondaryBreed")) %>%
filter(KeyDateYear %in% c(2015, 2023))
# Check for consistent age progression for these dogs
surviving_dogs_age_check <- surviving_dogs_records %>%
group_by(OwnerId, PrimaryBreed, SecondaryBreed) %>%
summarise(AgeDifference = diff(sort(DogAge)), .groups = "drop")
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Dogs with consistent age progression
consistent_age_progression_2 <- surviving_dogs_age_check %>%
filter(AgeDifference == 8)
print(consistent_age_progression_2)
## # A tibble: 1,340 × 4
## OwnerId PrimaryBreed SecondaryBreed AgeDifference
## <chr> <chr> <chr> <dbl>
## 1 100002 Chihuahua none 8
## 2 100018 Zwergschnauzer none 8
## 3 100030 Chihuahua none 8
## 4 100052 Jack Russel Terrier none 8
## 5 100060 Chihuahua none 8
## 6 100181 Malteser none 8
## 7 100181 Zwergpudel none 8
## 8 100184 Mops none 8
## 9 100192 Gordon Setter none 8
## 10 100220 Chihuahua none 8
## # ℹ 1,330 more rows
These codes calculates and visualizes the count of surviving dogs from 2015 to 2023 for the top 10 breeds based on their occurrence across all years.
By first determining the breeds with the highest occurrence across all years, it ensures a comprehensive selection. Then, filtering the dataset for dogs with consistent age progression, it narrows down the focus to these top breeds. The resulting bar plot vividly illustrates the distribution of surviving dogs among these breeds, providing valuable insights into their prevalence over the specified timeframe.
# Calculate the top 10 breeds based on their occurrence across all years in new_df
top_10_breeds <- new_df %>%
count(PrimaryBreed, sort = TRUE) %>%
slice_max(order_by = n, n = 10) %>%
pull(PrimaryBreed)
# Assuming 'consistent_age_progression' is a subset of new_df reflecting dogs from 2015 to 2023
# that survived and are in the top 10 breeds
consistent_age_progression <- new_df %>%
filter(PrimaryBreed %in% top_10_breeds, KeyDateYear %in% c(2015, 2023)) %>%
# Add any additional filters here for survival or presence in both years if not already filtered
group_by(OwnerId, PrimaryBreed) %>%
filter(n() > 1) %>%
ungroup()
# Pre-calculate the counts for plotting and ordering
top_breeds_counts <- consistent_age_progression %>%
count(PrimaryBreed) %>%
arrange(desc(n))
# Visualize the count of surviving dogs from 2015 to 2023 in the top 10 breeds
ggplot(top_breeds_counts, aes(x = reorder(PrimaryBreed, n), y = n, fill = PrimaryBreed)) +
geom_col() +
theme_minimal() +
labs(title = "Surviving Dogs from 2015 to 2023 by Top 10 Breeds",
x = "Breed", y = "Count of Surviving Dogs") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_viridis_d() +
theme(legend.position = "none")
This code segment performs a refined analysis of aging dogs from 2015 to 2023, focusing specifically on the top 10 breeds while excluding any entries labeled as ‘Unknown’. By filtering out these unknown entries from the dataset before calculating the breed occurrences, it ensures a more accurate representation of the prevailing breeds.
# Calculate the top 10 breeds based on their occurrence across all years in new_df
top_10_breeds <- new_df %>%
count(PrimaryBreed, sort = TRUE) %>%
filter(PrimaryBreed != "Unknown") %>%
slice_max(order_by = n, n = 10) %>%
pull(PrimaryBreed)
# Assuming 'consistent_age_progression' is a subset of new_df reflecting dogs from 2015 to 2023
# that survived and are in the top 10 breeds
consistent_age_progression <- new_df %>%
filter(PrimaryBreed %in% top_10_breeds, KeyDateYear %in% c(2015, 2023)) %>%
# Add any additional filters here for survival or presence in both years if not already filtered
group_by(OwnerId, PrimaryBreed) %>%
filter(n() > 1) %>%
ungroup()
# Pre-calculate the counts for plotting and ordering
top_breeds_counts <- consistent_age_progression %>%
count(PrimaryBreed) %>%
arrange(desc(n))
# Visualize the count of surviving dogs from 2015 to 2023 in the top 10 breeds
ggplot(top_breeds_counts, aes(x = reorder(PrimaryBreed, n), y = n, fill = PrimaryBreed)) +
geom_col() +
theme_minimal() +
labs(title = "Surviving Dogs from 2015 to 2023 by Top 10 Breeds",
x = "Breed", y = "Count of Surviving Dogs") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_viridis_d() +
theme(legend.position = "none")